Understanding Image and Text Simultaneously: a Dual Vision-Language Machine Comprehension Task

نویسندگان

Nan Ding

Sebastian Goodman

Fei Sha

Radu Soricut

چکیده

We introduce a new multi-modal task for computer systems, posed as a combined vision-language comprehension challenge: identifying the most suitable text describing a scene, given several similar options. Accomplishing the task entails demonstrating comprehension beyond just recognizing “keywords” (or key-phrases) and their corresponding visual concepts. Instead, it requires an alignment between the representations of the two modalities that achieves a visually-grounded “understanding” of various linguistic elements and their dependencies. This new task also admits an easy-to-compute and well-studied metric: the accuracy in detecting the true target among the decoys. The paper makes several contributions: an effective and extensible mechanism for generating decoys from (humancreated) image captions; an instance of applying this mechanism, yielding a large-scale machine comprehension dataset (based on the COCO images and captions) that we make publicly available; human evaluation results on this dataset, informing a performance upper-bound; and several baseline and competitive learning approaches that illustrate the utility of the proposed task and dataset in advancing both image and language comprehension. We also show that, in a multi-task learning setting, the performance on the proposed task is positively correlated with the endto-end task of image captioning.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Scaffolding Comprehension and Recall Gaps: Effects of Paratextual Advance Organizers

Although filling the gap in reading comprehension gained momentum with the rise of the top-down approach, Vygotsky’ concept of scaffolding and the dual code theory provided a strong support for the use of paratext to enhance comprehension. Scaffolding is dependent on other-regulation, one type of which is object-regulation. From this vantage-point, various types of paratext can function as sou...

متن کامل

Assessing Reading Comprehension of Expository Text across Different Response Formats

This study investigated if different response formats (test methods) measure reading comprehension of expository text differently. The study was conducted with 48 semester 6 TESL students at a university in Selangor, Malaysia. These students received an expository passage having descriptive rhetorical structure followed by three response formats, namely, incomplete outline, graphic organizer, a...

متن کامل

Learning Answer-Entailing Structures for Machine Comprehension

Understanding open-domain text is one of the primary challenges in NLP. Machine comprehension evaluates the system’s ability to understand text through a series of question-answering tasks on short pieces of text such that the correct answer can be found only in the given text. For this task, we posit that there is a hidden (latent) structure that explains the relation between the question, cor...

متن کامل

Attention-Based Convolutional Neural Network for Machine Comprehension

Understanding open-domain text is one of the primary challenges in natural language processing (NLP). Machine comprehension benchmarks evaluate the system’s ability to understand text based on the text content only. In this work, we investigate machine comprehension on MCTest, a question answering (QA) benchmark. Prior work is mainly based on feature engineering approaches. We come up with a ne...

متن کامل

Corpus based coreference resolution for Farsi text

"Coreference resolution" or "finding all expressions that refer to the same entity" in a text, is one of the important requirements in natural language processing. Two words are coreference when both refer to a single entity in the text or the real world. So the main task of coreference resolution systems is to identify terms that refer to a unique entity. A coreference resolution tool could be...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

CoRR

دوره abs/1612.07833 شماره

صفحات -

تاریخ انتشار 2016

Understanding Image and Text Simultaneously: a Dual Vision-Language Machine Comprehension Task

نویسندگان

چکیده

منابع مشابه

Scaffolding Comprehension and Recall Gaps: Effects of Paratextual Advance Organizers

Assessing Reading Comprehension of Expository Text across Different Response Formats

Learning Answer-Entailing Structures for Machine Comprehension

Attention-Based Convolutional Neural Network for Machine Comprehension

Corpus based coreference resolution for Farsi text

عنوان ژورنال:

اشتراک گذاری